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Abstract 

This article describes a new probabilistic method called the ’’class-specific method” (CSM). 
CSM has the potential to avoid the “curse of dimensionality” which plagues most classifiers which 
attempt to determine the decision boundaries in a high-dimensional featue space. In contrast, in 
CSM, it is possible to build classifiers without a common feature space. Separate low-dimensional 
features sets may be defined for each class, while the decision functions are projected back to 
the common raw data space. CSM effectively extends the classical classification theory to handle 
multiple feature spaces. It is completely general, and requires no simplifying assumption such as 
Gaussianity or that data lies in linear subspaces. 


1 Introduction and Background 

The purpose of this article is to introduce the reader to the basic principles of classification with class- 
specific features. It is written both for readers interested in only the basic concepts as well as those 
interested in getting started in applying the method. For in-depth coverage, the reader is referred to 
a more detailed article [1]. 

Classification is the process of assigning data to one of a set of pre-determined class labels [2], 
Classification is a fundamental problem that has to be solved if machines are to approximate the 
human functions of recognizing sounds, images, or other sensory inputs. This is why classification is 
so important for automation in today’s commercial and military arenas. 

Many of us have first-hand knowledge of successful automated recognition systems from cameras 
that recognize faces in airports to computers that can scan and read printed and handwritten text, 
or systems that can recognize human speech. These systems are becoming more and more reliable 
and accurate. Given reasonably clean input data, the performance is often quite good if not perfect. 
But many of these systems fail in applications where clean, uncorrupted data is not available or if the 
problem is complicated by variability of conditions or by proliferation of inputs from unknown sources. 
In military environments, the targets to be recognized are often uncooperative and hidden in clutter 
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and interference. In short, military uses of such systems still fall far short of what a well-trained alert 
human operator can achieve. 

We are often perplexed by the wide gap of performance between humans and automated systems. 
Allow a human listener to hear two or three examples of a sound - such as a car door slamming. From 
these few examples, the human can recognize the sound again and not confuse it with similar interfering 
sounds. But try the same experiment with general-purpose classifiers using neural networks and the 
story is quite different. Depending on the problem, the automated system may require hundreds, 
thousands, even millions of examples for training before it becomes both robust and reliable. 

Why? The answer lies in what is known as the “curse of dimensionality”. General-purpose clas¬ 
sifiers need to extract a large number of measurements, or features , from the data to account for all 
the different possibilities of data types. The large collection of features form a high-dimensional space 
that the classifier has to sub-divide into decision boundaries. It is well-known that the complexity of 
a high-dimensional space increases exponentially with the number of measurements [3] - and so does 
the difficulty of finding the best decision boundaries from a fixed amount of training data. Unless a 
lot is known about the data, allowing the features to be suitably conditioned so that the data samples 
fall in nicely organized patterns in the feature space, finding the “optimum” decision boundaries in 
a feature space above about 5 dimensions is futile [4], Optimum decision boundaries require finding 
the probability distributions (probability density functions or PDFs) of each class in the feature space 
[2], Sub-optimal decision boundaries, that is based on simplified probability models, or simple search 
procedures such as “nearest neighbor”, can achieve very good performance if the data from the vari¬ 
ous classes are well separated in the feature space, but fail dramatically if there is any degradation of 
training data or input data quality. 

These problems can potentially be avoided if we avoid working in a high-dimensional space. But 
how can we avoid working in high dimensions if all the measurements (features) carry pertinent 
information? One way is to keep a large number of features, but divide up the features according 
to their relevance to a particular class (class-specific features) and process them separately. Many 
schemes have been invented to try to find suitable rules for combining the processors [5], [6], [7], [8], 
[9], [10]. While they are on the right track, the problem with these classifiers is that they generally are 
unable to combine the results of the individual decisions in a way that is both theoretically optimal 
and completely general at the same time. What is needed is an extension to the classical theory of 
hypothesis testing that can account for class-specific features. 

In answer to this need, the author proposed the class-specific method (CSM) in 1998 [11], [12], 
[13], The initial formulation of the method suffered from several difficulties which were solved with 
the publication of the PDF projection theorem in 2000 [14], [15]. Further enhancements of the theory 
have resulted from the chain-rule [16],[1] a recursive application of the PDF projection theorem. The 
resulting classifier architecture, called the chain-rule processor [16], [1], blends the best aspects of signal 
processing and classification. CSM is completely general and makes no assumptions about the data 
such as that it yields to linear subspace decomposition. Nor does it require any special topology such 
as a binary tree of decisions. In fact, the classical feature classifier is a special case CSM that occurs 
when all classes are represented by a common feature set. But, unlike the classical classifier, CSM can 
circumvent the curse of dimensionality if each class can be represented (statistically described) using 
a separate low-dimensional feature set. 


2 



2 The Classical Approach 

The classical Bayesian classifier selects the most likely class hypothesis given the data according to 

j* = argmax p(Hj\x% 

3 =l 

where x is the data and {Hi, i ?2 • • • H M } are the M class hypotheses. Using Bayes rule, this may be 
written 

j* = arg max p{x\Hj) p{Hj). (1) 

This classifier has, in theory, the lowest probability of error of all classifiers [2],[17]. Unfortunately, the 
likelihood functions, or probability density functions (PDFs), denoted by p(x.\Hj) are unknown and 
need to be estimated from training data. Because the dimension of the raw data is too high, x has to 
be reduced to a set of information-bearing features using a feature transformation z = T(x). If it is 
possible to find a low-dimensional feature set that contains most or all of the necessary information, 
the problem can then be re-formulated in terms of z. By regarding z as the data, the Bayesian feature 
classifier becomes 

j* = arg max p(z\Hj)p(Hj)^ 

where p(z\Hj) are the feature PDFs estimated from training data. 

The classical approach to classification is summarized graphically in Figure 1 for two data classes. 
The original raw data space (X) is mapped to a feature space (Z) where the PDFs are estimated and 
the decision boundaries are constructed. The curse of dimensionality forces the following trade-off: 
If the feature dimension is too high, there are severe errors in PDF estimation causing classification 
errors. If the feature dimension is too low, the loss of information causes the classes to become 
overlapped in Z, also causing classification errors. There may be no feature dimension where the 
performance is acceptable. In short, the curse of dimensionality cannot be overcome. Once the raw 
data is discarded in favor of a common set of features, all hope is lost for achieving the best possible 
performance. 

3 The Class-Specific Method (CSM) 

Because this is a tutorial paper, we present only the most basic mathematical concepts of CSM. For 
further reading, the reader is referred to the most recent publications [1]. 

The classical approach loses the fight against the curse of dimensionality because it puts “all of 
its eggs in one basket”. It requires a low-dimensional feature set that contains all of the necessary 
information - an impossible request. Instead of discarding the raw data, CSM actually operates in 
the raw data domain - but it estimates the PDFs in low-dimensional feature spaces. This requires a 
two-step procedure. 

3.1 Step 1: Feature transformation and PDF estimation. 

First, the raw data is transformed into class-specific low-dimensional feature spaces. Let 


Zl 

= T l( X ) 

Z2 

= T 2 (x) 

ZM 

= Tm(x) 
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Figure 1: Illustration of the classical approach to classification. The original raw data space (X) is 
mapped to a feature space (Z) where the PDFs are approximated (ellipses) and decision boundaries 
(green line) are constructed. Because of potential information loss, the classes can become overlapped 
in Z, causing classification errors. 


be the M different feature sets and feature transformations. The PDFs p(z m \H m ), 1 < m < M, are 
then estimated from training data. This first step is illustrated in Figure 2. 

3.2 Step 2: PDF projection back to raw data domain. 

Next, CSM converts the feature PDFs into raw-data PDFs. It projects the PDFs back to the raw 
data domain where the decision boundaries are constructed. CSM avoids the complexity issues of the 
raw-data space because the projection operators (functions that transform the PDFs to the raw data 
domain) are known functions that can be determined exactly from the feature transformations. This 
last step is illustrated in Figure 3. 


3.3 How the projection works. 

Projecting the PDF from the feature domain back to the raw data domain is made possible by the PDF 
projection theorem [15], [14], This theorem may be thought of as a generalization of the well-known 
change-of-variables theorem which relates the PDF of y to the PDF of x when related by the 1:1 
transformation y = /(x). For continuous invertible transformations, it is a simple matter to recover 
the PDF of x from the PDF of y using the formula 


Px(x) 


dy_ 

dx 


Py( y)- 


( 2 ) 


The PDF projection theorem (PPT) is a generalization of (2) for many-to-one transformations. Under 
certain conditions (having to do with sufficient statistics) it is possible to actually recover p x (x) from 
the feature PDF. In general, however, the PPT can only find a particular one of the many possible 
PDFs of x that could have produced the given feature PDF. This particular choice has some nice 
properties. The projection operation is illustrated in Figure 4. Projection can only be accomplished 
if it is possible to know both the raw data PDF and feature PDF under some reference hypothesis 
//(). In general, it is impossible to determine the raw data PDF if all we know is the feature PDF. 
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Figure 2: Illustration of the first step of CSM. A separate feature transformation is designed for each 
class. The feature PDFs for each class are estimated on the corresponding feature space (ellipses). 



Figure 3: Illustration of the second step of CSM. The feature PDFs are projected back to the raw 
data space where the decision boundaries are constructed. 
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Figure 4: Illustration of the PDF projection operation. The projection can be accomplished only if is 
possible to know both the raw data PDF and feature PDF for some reference hypothesis Hq. 


The projection operation finds only an approximation to the true raw data PDF of the given class 
hypothesis. The projected PDF defined on the raw data domain is given mathematically by 


Pp{x\ H i) = 


P{*\ H o,j) 

P( z j\ H o,j) 




(3) 


where Hqj are the class-specific reference hypotheses. Thus, the partial derivative (which generalizes 
to the determinant of the Jacobian matrix for multi-dimensions) in equation (2) is replaced by a 
ratio of PDFs. As expected, the PDF projection theorem simplifies to (2) for continuous invertible 
transformations. It may be proved [15],[14] that p p (x.\Hj) is a PDF, so it integrates to 1 on the raw 
data space, and that it is a member of the class of PDFs which generate the original feature PDF 
p(z\Hj). This means that if a random variable x is drawn from the PDF in equation (3), and the 
result is transformed by the feature transformation zj — Tj(x), then the PDF of z j will be precisely 
p(zj\Hj), i.e. the projection process comes full circle. 

Various interpretations of the projection theorem can be suggested. One interpretation is that 
since there are an infinite number of raw data PDFs that generate the feature PDF, it is necessary to 
invoke a constraint so that one unique raw data PDF can be found. The applicable constraint is that 
the likelihood ratio with respect to the reference hypothesis remains constant in either domain: 

Pp{*\ H j) = P( z j\ H j ) 
p(*\ H o,j) p( z j\Ho d y 

Another interpretation is possible if we reverse the usual thinking. Normally we start with two 
statistical hypothesis, then seek a sufficient statistic for differentiating between them. But we could 
also start with just one hypothesis (Uo,j) and a statistic (z j) and ask “what would be a second 
hypothesis for which z j is sufficient against HqjT\ The PDF constructed according to (3) is the 
second hypothesis we seek. Thus, z j is sufficient for Hqj vs . the hypothesis that p p (x.\Hj) is true. 


3.4 How to choose the reference hypothesis. 

A detailed mathematical treatment of the issues surrounding the reference hypothesis are given in [1], 
Briefly, the conditions that Hq must satisfy for the projection (3) to result in a valid PDF are that 
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p(z\Hq) must never be precisely zero at any place where sample data can lie - otherwise, J cannot 
be evaluated. This is a rather mild constraint, easily satisfied by the most common PDFs such as 
Gaussian and exponential (assuming negative values are illegal). As long as this condition holds, 
equation (3) will be a PDF and will be among the class of PDFs which give rise to the specified 
feature PDF p(zj\Hj) when transformed by the given feature transformation. That being said, there 
are good and bad choices for Hqj. A good choice of Hqj (one that will result in a good approximation 
to p(x\Hj)) is one for which the features z j — Tj(x) are approximately sufficient statistics for testing 
Hj vs. Hqj. Sufficiency is meant in the statistical sense, and does not mean “just good enough”. It 
means that all information necessary to separate Hqj from Hj is present. Remember, though, that 
this condition is a goal, not a requirement and should not discourage anyone from trying a particular 
feature set. The closer the sufficiency condition can be approximated, the better the projected PDF 
will approximate p(x\Hj). It is also advisable that Hqj be such that p(x.\Hqj) and p(z\Hqj) can both 
be determined either in closed form, or else to a good approximation, even in the (far) tails. 


3.5 How to build a CSM classifier 

By substitution of (3) into the Bayesian classifier (1), the CSM classifier results: 


M 

arg max 
3 = l 


P{*\ H o,j) 

P( z j\ H oj) 


P( z j\Hj) p{Hj), 


(4) 


The ratio 

P{*\ H o,j) 

Pi z j\ H o,j) 

we call the “J-function” and may be considered generalized Jacobian or correction term necessary to 
create the optimal Bayes classifier from the various feature PDFs. 


J(x,Tj,H 0 j) = 


3.6 When is CSM optimal? 

Clearly if the projected PDFs (3) are valid PDFs, no matter if they are accurate approximations to the 
desired PDFs p(x|i7j), the classifier (4) is a valid probabilistic classifier. Optimality occurs when the 
projected PDFs are equal to the desired PDFs. This happens when (1) the estimated feature PDFs, 
p{zj\Hj) : are equal to the true feature PDFs, and (2) when the class-specific features, z j — T ? (x) are 
sufficient statistics for deciding between the given class Hj and the chosen reference hypotheses Hqj. 
Because the designer can choose both T)( ) and Hqj. it is to the designer’s benefit to choose them 
jointly to approximate this condition. Note also that Hqj must be chosen from those hypotheses for 
which it is possible to solve for both p(x.\Hqj) and p(zj\HQj). It is not always easy, but great strides 
have been made in recent years in being able to solve for the feature PDFs for many useful types of 
features [18], [19]. 

3.7 Why is CSM better than the classical approach? 

Both CSM and the classical approach have the same theoretical performance because they are both 
based on the optimal Bayesian classifier (1). Indeed, this is demonstrated in an experiment where 
a class-specific classifier was compared to a classical classifier using exactly the same features [11]. 
In the 9-class synthetic data experiment, the class-specific classifier used feature sets of dimension 
between 1 and 2, while the classical (full-dimensional) classifier operated on an 11-dimensional feature 
set (the union of the class-specific features). The performance was plotted as a function of the number 
of training samples and is repeated in Figure 5. It shows that although the maximum performance 
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Figure 5: Classification performance of CSM compared with classical approach in a 9-class synthetic 
data experiment. The classical approach used an 11-dimensional feature set composed of the union of 
all the class-specific features. 


of the classical classifier was the same, it required more than two orders of magnitude more training 
data to achieve it. Now imagine that only about 100 samples were available - observe on the graph 
the gap in performance that would exist. But the classical approach can never attain the promised 
performance because it needs to form a common feature set where the PDFs are estimated. The 
curse of dimensionality exacts a heavy toll on performance. For a given maximum feature dimension, 
CSM can collect much more information from the raw data because it can divide the information up 
according to class. 

3.8 The paradigm shift 

Those that have worked with the classical approach have difficulty changing over to CSM which is an 
entirely new paradigm. Someone trained to view features as carrying information to distinguish one 
class from another may have a difficult time viewing features in a way that ignores the other classes. A 
simple geometric example can illustrate the paradigm shift. Figure 6 shows a notional 4-class problem 
involving sets of geometric shapes. The classical paradigm involves finding features that are able to 
discriminate among the four classes. A list of six measurements or “features” are provided in the 
yellow box on the right side of the figure. These six features are adequate for discrimination among 
the four classes. The mathematical implementation of the classical feature paradigm involves the 
maximization of the feature PDF: 

j* = argmax p{z\Hj), 

3 

where z = T(x) is a common feature set. 

Figure 7 illustrates the class-specific paradigm using a fixed reference hypothesis. The features 
are required to discriminate each class from the common reference hypothesis. This is the original 
formulation of CSM but has a number of difficulties arising from the use of a common fixed reference 
hypothesis. The mathematical implementation of the fixed-reference class-specific paradigm involves 
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Classical Paradigm 



H t :Full equilateral 
triangles of 
arbitrary sias 
and color. 



H 2 : Full equilateral 
polygons of arbitrary 
color and no. of sides. 



' / 



H 3 : Empty red 
equilateral 
triangles of 
arbitrary size. 


H 4 : Full red triangles 
with fixed base length. 


Features needed 
to Separate all 
classes: 

■ color 

• number of sides 

• side 1 

■ side 2 

■ side 3 

• fullness 


Figure 6: The classical paradigm for a notional 4-class classification problem. The yellow box on the 
right lists six measurements or “features” useful for classifying the four classes. 


Class-Specific Paradigm 
(fixed reference hypothesis) 



Figure 7: The class-specific paradigm for a fixed reference hypothesis. The features (in yellow boxes) 
are required to discriminate each class from the common reference hypothesis. Note that fewer features 
are required for the simpler binary problems. 


9 













Class-Specific Paradigm 
(variable reference hypothesis) 







n 


■ color 

■ number 
of sides 


□ H 0,2 



■ side 1 

■ side 2 



Maximum of 2 features needed to separate each 
class from cl ass-dependent reference hypotheses 


Figure 8: The class-specific paradigm for class-specific reference hypotheses. The features are required 
to discriminate each class from the corresponding class-specific reference hypothesis. 


the maximization of the likelihood ratios 


= arg max 
3 


P( Z j\ H j) 

P{ z j W 


where z j — Tj(x). for j — 1... M, are class-specific feature sets. 

Figure 8 illustrates the class-specific paradigm using class-specific reference hypotheses. Class- 
specific features are not chosen to discriminate a class from other classes, they are chosen to discrim¬ 
inate each class from the corresponding class-specific reference hypothesis which may be regarded as 
a special member of the class. In effect, this means the features are chosen to describe the class. It is 
important to remember that discrimination happens automatically if each class is well described. In 
effect, choosing features for description results in the same (or more) feature information content as 
discrimination but it assigns the information only to those classes for which it is relevant. In spite of 
this, it is a difficult paradigm shift to make for many people who have been taught to choose features 
for discriminatory power. The mathematical implementation of the class-specific paradigm involves 
the maximization of the projected PDFs 


where 


j* = arg max p p (x\Hj), 
3 


Pp{*\ H j ) = P( z j\ H j)i 


p{zj\H 0d ) 

where z j — Tj(x), for j — 1... M, are class-specific feature sets and iloj, for j 
class-specific reference hypotheses. 


1... M, are 


3.9 Working in the raw data domain 

CSM creates decision boundaries in the raw data domain instead of in a common feature domain. This 
sounds troublesome at first. After all, the raw data dimension can be very large and we are interested 
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in reducing the dimension! But remember that PDF estimation happens on the low-dimensional 
feature space, not in the raw data space. Projecting to the raw data domain is done for us by the 
J-function (5) which does not suffer from the dimensionality curse because it does not need to be found 
empirically. The J-function can be determined exactly by analysis of the feature transformations. 

There is a clear advantage to working in the raw data domain because it is a common ground 
where everything can be compared fairly. Interestingly, CSM is not the first attempt to work in 
the raw data domain. For example, Bishop [20] creates raw data PDFs, but his approach requires 
linear transformations to be tractable and amounts to something akin to principal component analysis 
(PCA). CSM, on the other hand, is completely general. The “ace in the hole” is the fact that the 
projected PDF is indeed a PDF and it depends only upon a few parameters - the parameters of the 
feature PDF and of the feature transformation and reference hypothesis. All of these parameters 
are “fair game” in a maximum likelihood maximization. Have you an idea for a better feature set? 
Compare it to the existing feature set based on the maximum likelihood principle. Have you an idea for 
a better reference hypothesis? Compare it to the existing reference hypothesis based on the maximum 
likelihood principle. This idea can is represented mathematically as 


L(xi.. .XK,i?o,T,6») 


max 

Ho,T,0 



p{*k\Ho) 

.Pz{T{x k )\Ho)_ 


Pz{T(x k );6) |, 


( 6 ) 


K is the number of independent data samples and the subscript z is a reminder that the PDF p z { ) is 
a function of the features z = T(x). To avoid “over-training”, when implementing (6) in practice, it is 
recommended that the data be partitioned into separate training and testing sets for cross-validation. 


3.10 Classifying without training. 

The PPT (3) is a decomposition of the raw data PDF into a trained and an untrained factors. The 
trained factor is the feature PDF p(zj\Hj) which needs to be estimated from training data of the 
corresponding class. The untrained factor [p(-x\Hoj) /p(zj\Hoj)\ is a known function of the input raw 
data x, feature transformation Tj{ ), and reference hypothesis Hqj. But, for a fixed x, it also can be 
viewed as a function of j. Thus, it contributes, sometimes in a dominant role, to the classification 
decision. While the trained component asks “how does this sample compare with trained patterns?”, 
the untrained component asks “how well does this feature set represent this raw data sample?”. The 
untrained component can be seen as a generalization of a matched filter. 

To see this, we consider a bank of linear matched filters as a set of class-specific feature extractors 


z i = t j( x ) = K x l 2 > 

where w j is a signal template. Let w j be normalized such that w'w ; = 1. The simple matched filter 
bank classifier is given by 


j — arg max Zj. 

j 

Let us now design a class-specific classifier for these features. Under the reference hypothesis of 
independent Gaussian noise of variance 1, Zj is distributed y 2 (l), 


Pi z j \ H o) 


1 


sj2-KZj 


exp 


H) 


The PDF of x under Hq is the Gaussian PDF 


p(x|JJo) = (2?r) n/2 
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The log-J-function is easily shown to be 

log Jj (x, Zj) = logp(x|.ffo) - \ogp(zj\H 0 ) = i (log+ Zj) + C(x), 

where C(x) does not depend on j. The complete class-specific classifier is: 

arg max log J,(x, zA + logp(zdiT). (7) 

3 

Since log Jj is a monotonic increasing function of Zj, using only log J ? (x, Zj) as a classifier is equivalent 
to the matched filter bank classifier. Note that by ignoring the last term in (7) effectively assumes 
that each class has the same expected amplitude distribution. 

It is clear that classification is quite possible without training as long as each class requires a 
distinctly different feature set for representation. This idea should not be taken literally without some 
care. Generalizing the “J-function-only” classifier to cases where the features are not matched filters, 
requires that some kind of a priori feature PDF should be used to account for differences in feature 
dimension and scaling. Note that this requirement is relaxed if the J-function is highly dominant. 


4 The chain rule and the chain-rule processor 


As part of the paradigm shift from the classical architecture, we recommend looking at a sophisticated 
general-purpose classifier as a bank of signal processors. Each signal processor may be thought of 
as an optimal detector for differentiating the given class from the corresponding reference hypothe¬ 
sis. Each signal processor may be composed of multiple processing stages. If we regard the feature 
transformation z = T(x) as a single step, we write the projected PDF as 


Pp{*\Hi) 


' p{x\Ho) ' 
-Pi z \ H o). 




However, if we regard the feature transformation as three separate stages, y = T'(x), w = T"( y), 
then z = T'"{ w), we may apply the PDF projection theorem recursively. For the first stage, we have 


Pp(*\Hi) 


' p(x|J7q) ~ 
.p{ y\Ho). 


p P {y\Hi). 


Applying the same concept to p p (y\Hi), we have 


p P {y\Hi) = 


' p{y\ H o) ' 
-PM H o)- 


Pp(™\Hi), 


and so on. The complete break-down is written 


p p {x\Hi) 


~ p(x|-Ho) ' 

.p{ y\Ho). 


p{ y\ H o) 

p( w \Hq) 


p( Hgo) 




( 8 ) 


where IIq. //q. 7/q are reference hypotheses suited to each stage in the processing chain. The advantage 
of this approach is first that many processing chains may share the same first stages of processing, thus 
saving processing. Furthermore, analyzing just one stage at a time simplifies the analysis. Finally, 
there is great advantage in software modularity because each stage of processing can be encapsulated 
as a module. 
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4.1 Feature Modules. 


Feature modules are pre-packaged software modules that contain both feature calculation and J- 
function calculation. The three modules necessary for implementing the above three-stage chain 
would be 


Module 1: y = T 7 (x), 

, p(x|-Ho) 

31 = los p(y|flo)’ 

Module 2: w = T"(y), 

p(w|Uo) 

Module 3: z = T w (w), 

h = log J,(w|ff « ) 

J3 s pW) ' 


Completion of the processing chain is accomplished by accumulating the “correction terms” 

logp p (x|Hi) = ji +J2 +J3 +logp(z|Hi). (9) 

Class-specific classifiers can be rapidly designed by stringing together chains of pre-designed modules 
and accumulating the log J-function values. 

4.2 Classifier Architecture. 

Implementation of a classifier is illustrated in Figure 9. Each horizontal chain corresponds to one class. 
The chains are made up of series of modules. In accordance with equation (9), each module adds the 
corresponding correction term (J-function) to the stream. At the end, the aggregate J-function is 
added to the log feature PDF to arrive at the class output value. 

5 Building a Classifier 

Because CSM is new, there is a large learning curve for those being introduced to it. There are many 
difficulties and pitfalls associated with building a classifier that should be mentioned. 

5.1 Common Problems. 

The following is a list of problems and difficulties that are often encountered in designing and imple¬ 
menting a class-specific classifier. 

1. Sufficiency. Recall that the designer should, as a goal, strive for a feature set/reference hy¬ 
pothesis combination where the features are approximately sufficient to discriminate the class of 
interest from the reference hypothesis. Sufficiency is does not mean “just enough”, i.e. sufficient 
to get the job done. Sufficiency means all of the information has been extracted for discrim¬ 
ination. But this is a goal, not a requirement. It should not discourage anyone from using a 
set of features that is reasonable. A common mistake is to leave out a significant amount of 
information relating to the discrimination of a given class from the fixed reference hypothesis 
simply because it is not necessary to discriminate the data most if not all of the time. Here’s 
an example. Consider discriminating a sinewave in additive correlated noise from a reference 
hypothesis of independent noise. While it may be adequate to concentrate on the sinewave, do 
not lose sight of the fact that the background noise also is different from Ho and can significantly 
contribute to discrimination. It would be better in this case to use the correlated noise as the 
reference hypothesis. 
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Figure 9: Block diagram of a class-specific classifier using chain rule processors. 


2. Using “all classes” as a reference hypothesis. The suggestion that the reference hypothesis 
Hq can be defined as a combination of “all classes” has been made several times. While in 
principle, Hq can be any hypothesis, even this one, it defeats the purpose of class-specific features. 
This is because all the features are needed for discrimination of one class from all the others (see 
above item). Furthermore, the “all class” does not yield to mathematical analysis, needed to 
compute the J-function. 

3. Tail Probability Errors. A common misconception is that the denominator PDF in the 
J-function, p(z|ifo), can be estimated from training data. This is only true if all possible real¬ 
izations of input data will be within the central part of the distribution and not highly unlikely. 
This could work, for example, with low-SNR signals. But such a system would perform poorly 
against high-SNR signals. It may be possible to position the reference hypothesis “close” to the 
data sample, then attempt to estimate the PDF of the features by random trials. Note that this 
would need to be re-done for each sample to be tested. It also must meet the requirements for 
a variable reference hypothesis. 

4. Segmentation. Segmentation is the practice of carving up data into fixed-sized segments, then 
extracting features from each segment. This is an important first step in processing. The choice 
of segmentation size is often a difficult choice in traditional classifiers, because it is necessary to 
choose the segment size that is “good enough” for all classes. The class-specific method affords 
us the luxury of using different segmentation sizes for each class. This is because the likelihood 
comparisons are made on the raw data, which is always the same. A common error people make 
is that because of the different segment sizes used across different classes, the amount of raw data 
varies slightly due to the fact that the input data size is not divisible by all segment sizes. This 


14 






















































































can be a fatal error. It is necessary to only use an input data record size that can be divisible 
perfectly by each considered segment size. 

5. Failure to Validate Analysis. Some form of absolute validation is necessary before using a 
module. In section 7, a method of validating the J-function analysis is provided. There is no 
obvious way to locate errors except with this approach. 

5.2 Module Design 

There are more than one method of module design. The designer should not give up on using a good 
set of features because one module design approach fails - there may be another that works. 

1. Fixed reference hypothesis. In this approach, a fixed reference hypothesis, such as indepen¬ 
dent Gaussian noise of a fixed variance is chosen. Then, the numerator and denominator densities 
of the J-function must be known exactly or approximated with the saddlepoint approximation 
[18] to insure accurate tail values. 

2. Floating reference hypothesis. Floating the reference hypothesis by positioning it “close” to 
the data sample to be tested is a means of avoiding the tails. In general, a reference hypothesis 
cannot be made dependent on the data - this violates the concept of a statistical hypothesis. But 
under certain conditions, the dependence of the numerator and denominator of the J-function on 
changes in the reference hypothesis cancel out making the approach feasible [1]. The reference 
hypothesis may be floated as a function of the data as long as the features are sufficient statistics 
to distinguish all the possible hypotheses that may result. Floating the hypothesis may be a 
simple as adjusting the variance of the Gaussian assumption to agree with the sample variance 
of the data. Or, it may be as sophisticated as controlling the noise spectrum of an autoregressive 
model to agree with the observed autocorrelation function. The designer must insure that the 
features are sufficient or approximately sufficient to discriminate between the various reference 
hypotheses. For example, any feature set that contains the sample variance explicitly as a 
component or where the sample variance can be inferred from the features is fully sufficient to 
discriminate between any pair of variance hypotheses. Therefore, the variance of the reference 
hypothesis can be “floated”. 

3. On-the-fly analysis. It is possible to make a rapid Montecarlo-type analysis of the feature 
PDFs under a floating reference hypothesis l . This is useful when the PDF of the features defies 
analysis. 

5.3 NUWC Module Library 

The class-specific module is the building block of a class-specific classifier. It can be a source of 
frustration if a classifier designer wishes to use a feature set and cannot because no analysis is available. 
This is why a library of pre-tested class-specific modules is useful. A central repository of class-specific 
modules is being collected at a web-site at NUWC: 

http://www.npt.nuwc.navy.mil/csf/index.html 

To date, this collection includes the following feature transformations: 

1. Various invertible transformations. 

1 The author wishes to thank Mario Fritz for this suggestion 
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2. Spectrogram. 

3. Arbitrary linear functions of exponential RVs. 

4. Autocorrelation function (contiguous and non-contiguous). 

5. Autoregressive parameters (Reflection coefficients). 

6. Cepstrum (including MEL Cepstrum). 

7. Order statistics of independent RVs. 

8. Sets of quadratic forms. 

New feature modules may be designed using the analysis tools of CR bound analysis (for maximum 
likelihood features) Readers are encouraged to use the library and submit their own contributed 
modules. 


6 Examples of Feature Chains 


Examples are necessary to make clear the important point thus far discussed. Each example shows 
how a feature transformation chain can be analyzed to obtain the correction term for PDF projection. 
When the feature transformation occurs in more than one step, the examples are broken down into 
separate modules. For each module, we provide the following information (all enclosed in boxes for 
clarity), 


Feature Calculation: The mathematical expression of the feature calculation. 


Hq : A description of the reference hypothesis. 


The class-specific correction term (J-function) is given by 

7/ T TJ \ _ l g o) 

p(z\H 0 ) 

We separately provide the numerator and denominators: 


Numerator PDF: The numerator PDF of the J-function. 


Denominator PDF: The denominator PDF of the J-function. 


The simplest kind of feature transformation is an invertible transformation. While these are not 
useful for dimension reduction, they are important for feature conditioning. For invertible transfor¬ 
mations, the J-function is just the absolute value of the determinant of the Jacobian matrix of the 
transformation. Thus, 

J(x,T) = |det(J)| 

where 


J = 


For invertible transformations, we provide the complete J-function only: 


dz\ 

dz\ 

dz\ 

dx\ 

8 x 2 

8 x 3 

dZ 2 

dZ 2 

dZ 2 

dx\ 

dX 2 

8 x 3 


J-Function (Jacobian): The log of the determinant of the Jacobian matrix. 
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6.1 Log Transformation 

An example of an invertible transformation is the log function. Consider the transformation 


Feature Calculation: z, = log(x' 4 ), 1 < i < M. 

We have dz/dx — 1/x, thus log J — log(l/;r) = — logo: = —z. For taking the logarithm of a vector of 
length M, we have 

J-Function (Jacobian): log J = 1 z i- 

6.2 Variance Estimate 

A very simple example of a class-specific module is the sample variance. Let x be a time-series of 
length N and let z be the variance estimate 

Feature Calculation: z = T(x) = jfJ2n=i x l- 

Let the reference hypothesis be 

Ho : Independent zero-mean Gaussian noise of variance 1. 

Then the numerator of the J-function is 

Numerator PDF: logp(x|Lfo) = — ^log^Tr — ^ • 

Since z has the Chi-squared distribution with N degrees of freedom (scaled by 1 /V), the denominator 
of the J-function is 

Denominator PDF: logp(z|ifo) = log N — logT (N/ 2) — y log2 + — 1^ log(Vz) — yA 


6.3 Autocorrelation Function. 

A very useful feature set in stationary time-series analysis is the autocorrelation function (ACF). The 
ACF coefficient of lag r is an estimate of the mean or expected value of the product xtxt- ri which 
for stationary time-series, is independent of t. The ACF is the fundamental feature extraction behind 
many spectral estimation techniques with varying names such as linear predictive coding (LPC), 
autoregressive (AR) modeling, and reflection coefficients (RC). All of these methods are related and 
begin by estimating the ACF using a variety of methods. The benefit of AR modeling is that the 
spectral information can be boiled down to but a few coefficients which can hold spectral information 
with high resolution. The first P + 1 ACF lags (r = 0,1... P) are required for a P-th order AR 
model [21], These ACF lags can then be transformed to RCs or AR coefficients using invertible 
transformations, thus they are equivalent from a modeling point of view. A good source of information 
on the topic is the book by Kay [21], 

It may also be useful to use arbitrary ACF lags, rather than only the first P + 1 lags. This is 
especially true when dealing with periodic time-series such as human voice, where the lag value at the 
pitch period is also of interest. Let x = [xi,X 2 ■ ■ ■ xn] be a time-series of length N. We define the 
M-dimensional feature set z as the arbitrary ACF lags £q, & 2 , • • • Thus, the feature calculation is 

Feature Calculation: z = [r kl , r kl . ..r kM \, where r k = A Yd=i XiX[ i+k \ m 
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where the braces [i — k]^ indicates modulo-N. These are known as the circular ACF estimates because 
of the modulo indexing. We choose this form of the ACF because it simplifies the analysis. A solution 
is available for arbitrary forms of the ACF based on quadratic forms [19], but is more complicated. 
As before, let the reference hypothesis be 

H 0 : Independent zero-mean Gaussian noise of variance 1. 

Then, as before, the numerator of the J-function is 

Numerator PDF: logp(x|iJo) = — y log27r — ^ ■ 

There is no known closed-form expression for the joint PDF of z under II q. although a cumbersome but 
exact expression is available for the normalized statistics ff. — r^/ro (See [18] section II.B). However, 
an approximation based on the saddlepoint approximation (SPA) [22] that is valid in the tails has 
been published. Specifically, in reference [18], section IV.B, the SPA for the scaled ACF estimates 
z = 2Nz are derived. The J-function denominator is thus, 

Denominator PDF: logp(z|iJo) = Mlog(2iV) + logp(z|ifo), 

where p(z|Ffo) is from reference [18], section IV.B. 

6.4 Contiguous ACF and Reflection Coefficients. 

Reflection coefficients (RCs) are an alternate way of representing the information in an AR model. 
The RCs can be more convenient and easier to statistically model. Reflection coefficients (RCs) may 
be calculated from ACF estimates [21], and therefore we may use the results of Section 6.3 followed 
by a conversion to RCs. However, Section 6.3 is more general since it describes an approach that 
can handle arbitrary ACF lags; whereas the RCs are computed from a contiguous set of ACF lags 
(lags 0 through P ). The use of contiguous ACF samples allows a different approach to analysis of 
the ACF features which is both instructive and useful for comparison purposes. If we use the circular 
ACF estimates as before, we can calculate the ACF samples by first computing the magnitude-squared 
DFT, then the inverse DFT. A third stage is necessary to convert to RCs and a fourth stage is used for 
further conditioning. The complete chain provided below has been found to be extremely versatile in 
modeling time-series. By segmenting the time-series, signals can be converted into sequences of feature 
vectors that can be statistically modeled using the a hidden Markov model (HMM). These feature 
sequences can also be converted back into time-series to validate the fidelity of the representation. 
As an additional check of model fidelity, the trained HMM can be used to generate random feature 
sequences, then converted into time-series for listening. Because of the versatility of CSM, each signal 
type can be represented using a particular choice of segment size and AR model order. 

6.4.1 Stage 1: Magnitude-Squared DFT 

In the first stage, we let y — [f/o 5 Vi ■ ■ ■ Vn/ 2 ] be the magnitude-squared DFT of x, 


Feature Calculation: y^ — 

^ ( j2-7r(i — l)k V 

E^expj ^ M: 

2 

, k = 0,1... A/2. 


i= 1 ^ > 



As before, we let the reference hypothesis be 


Hq : Independent zero-mean Gaussian noise of variance 1. 
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Also, as before, the numerator of the J-function is 


Numerator PDF: logp(x|i?o) = — Yl°g2vr — ^ (j2iLi x “f) ■ 

The DFT bins are independent under Hq, but not identically distributed. DFT bins 0 and N/2 are 
real-valued so y & have the Chi-squared distribution with 1 degree of freedom scaled by N, which we 
denote by po(y): 

DFT bins 1 through N/2 — 1 are complex so have the Chi-squared distribution with 2 degrees of 
freedom scaled by N/2, which we denote by p\{y). 


Pi{y) = -jy exp 



The complete denominator PDF is 

Denominator PDF: logp(y|tf 0 ) = logp 0 (yo) + Ek=i lo SPi(yk) + logpo(y N / 2 )- 


6.4.2 Stage 2: Inverse DFT 

In the second step, let r = [ro, r j ■ ■ ■ rp\ be the first P + 1 ACF lags, which can be computed from 
1/N times the first P + 1 samples of the real part of the inverse DFT of y. This may be written as 


1 n/2 

r k= Trt X e * Vi cos 

iv i =0 



, k = 0,1... P, 


where e, = 1 for i — 0, N/2 , and ei — 2 for i — 1,2... N/2 — 1. This may be written in the matrix 
form 

Feature Calculation: r = C' y, 

where matrix C is defined accordingly. 

Now, for the first time, we use a reference hypothesis other than independent Gaussian noise. 
In fact, we use a floating reference hypothesis - one that depends upon the data sample. The use 
of a floating reference hypothesis and the constraints on how it may vary are discussed elsewhere 
[1], The floating reference hypothesis is the AR spectrum corresponding to the ACF r. Using the 
Levinson-Durbin recursion [21], we may transform r into the AR coefficients { 01,02 ... ap, a 2 }. The 
corresponding AR spectrum is written 



where the superscript “r” is a reminder that the AR spectrum depends on r. We let our reference 
hypothesis, denoted by be that the mean of y equals the AR spectrum y r = [tf {) . y[... y r N j 2 ]■ 

For simplicity, we assume the elements of y are independent. 

Ho(r) : That y has independent elements with mean E(y) = y r . 
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Under Ho(r), the elements of y are independent and Chi-squared with 1 or 2 degrees of freedom 
with mean y r k . Bins k — 0. N/2 are distributed according to 

while bins 1 through N/2 — 1 are exponentially distributed according to 


pi{y,y r ) 


l 

y r 



In summary, 

N/2 

Numerator PDF: logp(y \H 0 (r)) = \ogp 0 {yo,yb) + lo SPi(yk-/y r k ) + ^ogpo{y N /2,y r N / 2 )- 

k=l 

We may use the central limit theorem (CLT) to approximate the PDF of r under Hq(t) because 
the mean of r under Hq(t) is very nearly r itself. Under Hq(t), the elements of y are independent with 
mean y r and diagonal covariance J7 y given by 


V r v {iii)=S{{ yi -y\) 2 \H Q {r)) = { 


{ 2{y\)\ « = 0, N/2 
(Vi) 2 , !<*<N/ 2-1. 


Under //o(r), r has mean 
and covariance 


r r = E(r|i? 0 (r)) = C ; y r , 


K = c'££ c. 


logp(r|i? 0 (r)) = log(27r) - \ log |det(££)| 

4(r-rm)' 1 (r-r r ). 


If we we make the approximation v r ~ r, we obtain 


( 10 ) 


Denominator PDF: logp(r|ifo(r)) = — ^ l°g(27r) — j log |det(E£)|. 


6.4.3 Stage 3: Conversion to RCs. 

The conversion from ACF to RCs is an invertible transformation that is characterized by a Jacobian 
matrix. The determinant of this matrix is the J-function of the transformation. 

Feature Calculation: r —> (Levinson recursion for reflection coefficients) —> k, 

where r is the ACF vector, r = [r'o, rq,... rj>\ , and z is the vector of reflection coefficients augmented 
by the variance (zero-th lag ACF sample), 

k = [r 0 ,k 1 ...k P \. 

Note that we use ro and not the AR prediction error variance This transformation is invertible 
and is characterized by the Jacobian 

p -1 

J-function (Jacobian): log J = -Plog(ro) + ^(P - i) log(l - k‘f). 

i =1 
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6.4.4 Stage 4: Log-Bilinear Transformation. 

Although the RCs have desirable properties as features, they are subject to the limit \ki\ < 1 which 
produces a discontinuity in the PDF. As a result, the PDF can be difficult to estimate using so-called 
non-parametric PDF estimators such as Gaussian mixtures. To obtain more Gaussian behavior, the 
log-bilinear transformation is recommended (thanks to S. Kay for recommending this). 

Feature Calculation: k[ — , 1 < i < P, r' 0 — log(ro) . 

This transformation is invertible and is characterized by the Jacobian 

p 

J-function (Jacobian): log J — r' 0 — y log 

i =1 



7 Experimental Validation 

A very important question in developing a class-specific classifier is how to validate the analysis of 
a feature transformation. Because the numerator and denominator PDFs of the J-function are often 
evaluated in the far tails, we can never know if these PDF values are correct by histogram techniques. 
In Section 6.3 and 6.4 (up to stage 2), two method are presented for calculating the J-function for ACF 
samples. It may be verified that for contiguous ACF samples, the two approaches produce exactly the 
same features. The J-function values produced by the two methods are very close, but not exactly 
the same. Such comparisons are reassuring but are not a complete test and cannot be made for all 
problems. The following approach is a complete end-to-end test that has proved to be very useful. 

Validation of the feature modules amounts to validating the PDF projection theorem (3). To 
validate equation (3), we design a hypothesis J7 V for which we know the PDF p(x|// v ) exactly and 
for which we can create a large amount of synthetic raw data samples. We convert the synthetic data 
to features which we use to obtain the PDF estimate p(z\H v ). Using this estimate in equation (3), 
we obtain an estimate p p (x.\H v ). To validate the result, we plot the projected PDF values p p {x\II v ) 
on one axis and the exact values p(x|Ff v ) on the other axis for each sample of synthetic data. The 
points should lie near the y — x line. An example is shown in Figure 10 where we tested the chain of 
four feature modules in Section 6.4. The synthetic data used in the experiment were 100 time-series 
of independent Gaussian noise of variance 100 and length 4096. The features were computed using 
an AR model order of 4 with segmentation to 64-sample segments, thus producing 64 independent 
feature vectors of dimension 5 per sample. A Gaussian mixture model was used to statistically model 
the features. 

8 A Class-Specific Time-Series Classifier Using Reflection Coeffi¬ 
cients and HMM. 

We can put to use the material thus-far discussed to arrive at a fully modular, extremely versatile 
class-specific classifier. A functional block-diagram of this classifier is provided in Figure 11. A 
given time-series is processed by each class-model to arrive at a raw-data log-likelihood for the class. 
Each block labeled “RC(P)” computes the reflection coefficients of order P from the associated time- 
series segment. The figure shows two class-models employing different segmentation lengths as well 
as different model orders. The log-correction terms of all the segments are added together and the 
aggregate correction term is added to the HMM log-likelihood (from the forward procedure [23]) to 
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Figure 10: Example of validation test results for 4-th order autoregressive features (Section 6.4 stages 
1-4). The upper graph shows theoretical log-PDF values on the x-axis and PDF projection theorem 
values on the y-axis for 100 synthetic events. The lower graph shows the errors. 


arrive at the final raw data log-likelihood for the class. The segmentation sizes and model orders are 
optimized for each class individually, eliminating the need to “compromise”. 

Each “RC(P)” block is composed of a series of modules implementing ACF calculation followed by 
conversion to RCs, and ending with feature conditioning by the log-bilinear transformation. This may 
be implemented by the three modules described in Sections 6.3, 6.4.3, and 6.4.4. Alternatively, the 
four modules of Sections 6.4.1, 6.4.2, 6.4.3, and 6.4.4 will produce virtually identical features and J- 
function values. This classifier has the added benefit that the models may be validated by re-synthesis 
of time-series from features (either computed from actual data or generated at random by the HMM). 

It should be stressed that we are not limited to using RC features and HMM PDF models. As 
long as care is taken in computing the correction terms, any feature set and any statistical model may 
be employed. Straight DFT features may be preferable to RC features for sinusoidal signals. Wavelet 
features may be preferable for certain other types of signals. A particularly good set of features for 
DFT (or wavelet processing) is to save the largest M bins and residual energy. The correction term 
for this feature set has been worked out by Nuttall. [24], Nuttall has also derived the correction term 
for features that may be written as a set of inter-dependent quadratic forms, [25]. 

9 Conclusion 

Previous to the class-specific method, practitioners in image or signal classification had no guidance 
from classical theory in dealing with complex problems. The incomplete theory forced practitioners to 
think of feature extraction from the point of view of class separability. This flawed paradigm led the 
practitioner down the slippery slope of high dimensionality. Now that the reader has been introduced 
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Figure 11: Block diagram of an HMM and RC-based class-specific classifier. A given time-series is 
processed by each class-model to arrive at a raw-data log-likelihood for the class. Each block labeled 
“RC(P)” computes the P-th order reflection coefficients from the corresponding time-series segment 
and is implemented by a series of modules (See text). 
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to the fundamental concepts of classification theory using class-specific features, he or she has the tools 
necessary to attack classification problems one class at a time, capturing all the necessary information 
in the features and not being forced to “make-do” with features that are general enough for all classes, 
but not sufficient for any class. The examples provided are enough to build a simple, yet effective 
class-specific time-series classifier. 
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